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Abstract — This paper presents a supervised machine learning 
approach that uses a decision tree learning algorithm for 
recognition of Bengali noun-noun compounds as multiword 
expression (MWE) from Bengali corpus. Our proposed 
approach to MWE recognition has two steps: (1) extraction of 
candidate multi-word expressions using chunk information 
and various heuristic rules and (2) training the machine 
learning algorithm to recognize a candidate multi-word 
expression as Multi-word expression or not. A variety of 
association measures have been used as features for 
identifying MWEs. The proposed system is tested on a Bengali 
corpus for identifying noun-noun compound MWEs from the 
corpus. 

Index Terms — noun-noun compound, multiword expression, 
association measure, decision tree. 

I. Introduction 

Multiword expression (MWE) from a text document can 
be useful for many NLP (natural language processing) 
applications such as information retrieval, machine 
translation, word sense disambiguation. Frank Samadja (1993) 
has defined MWEs as "recurrent combinations of words that 
co-occur more often than expected by chance" [1]. Timothy 
Baldwin et al.(2010) defined multiword expressions(MWEs) 
as lexical items that can be decomposed into multiple lexemes; 
and display lexical, syntactic, semantic, pragmatic and/or 
statistical idiomaticity [2]. Successful NLP applications need 
to identify MWEs and treat them appropriately instead of 
using a simple list of MWEs. 

Jackendoff(1997) estimates that the number of MWEs in 
a native speakers 's lexicon is of the same order of magnitude 
as the number of single words [3]. In WordNet 1 .7 (Fellbaum, 
1999), for example, 41% of the entries are multiword [4]. 

MWEs can be broadly classified into lexicalized phrases 
and institutionalized phrases (Ivan A. sag et al., 2002) [5]. In 
terms of the semantics, composition ality is an important 
property of MWEs. Compositionality is the degree to which 
the features of the parts of a MWE combine to predict the 
features of the whole. According to the compositionality 
property, the MWEs can take a variety of forms: complete 
compositionality ( also known as institutionalized phrases, 
e.g. many thanks, '^TO siswsr (Rajya Sabha, state 
government)), partial compositionality ( e.g. light house, 
'^f^ft WY' (shopping mall), L ^TR ^lhf^T ( aam admi, 
common people)), idiosyncratically compositionality (e.g. spill 
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the beans ,open secret, which is decomposable) and finally 
complete non -compositionality (e.g. hot dog, green card, 
L 5 '.s ^ (ubhoy sangkat, on the horns of a dilemma), 

which is non-decomposable). 

Compound noun is a class of MWE which is rapidly 
expanding due to the continuous addition of new terms for 
introducing new ideas. Compound nouns fall into both 
groups: lexicalized and institutionalized. A noun-noun 
compound in English characteristically occurs frequently with 
high lexical and semantic variability (Takaaki Tanaka et al., 
2003) [6]. Since compound nouns are rather productive and 
new compound nouns are created from day to day, it is 
impossible to exhaustively store all compound nouns in a 
dictionary. 

It is also common practice in Bengali literature to use 
noun-noun compound as MWEs. Bengali new terms directly 
coined from English terms are also commonly used as MWEs 
in Bengali (e.g.Y^ f|f T (dengue three), £ ^iwi pfSf (nano 
sim), 'f^jjTO I^f^W (village tourism), ■=^ypt[& (*JPW 
(alert message;;." 

The main focus of our work is to develop a machine 
learning approach based on a set of statistical features for 
identifying Bengali noun-noun compounds. 

To date, not much comprehensive work has been done 
on Bengali multiword expression extraction. 

Previous works related to our proposed work are 
presented in section II. The proposed MWE identification 
method has been detailed in section III. The evaluation and 
results are presented in section IV and conclusions are drawn 
in last section. 

II. Related Work 

Multiword expression extraction can be broadly classified 
as: Association measure based methods, deep linguistic 
based methods, machine learning based methods and hybrid 
methods. 

The earliest works on MWE extraction used statistical 
measures for multiword expression extraction. One of the 
important advantages of using statistical measures for 
extracting multiword expression is that these measures are 
language independent. Frank Smadja (1993) developed a 
system called Xtract that uses positional distribution and 
part-of- speech information of surrounding words of a word 
in a sentence to identify interesting word pairs [1]. Classical 
statistical hypothesis test like Chi-square test, t-test, z-test, 
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log-likelihood ratio (Ted Dunning, 1993) have also been 
employed to extract collocations [7]. Gerlof Bouma(2009) has 
presented a method for collocation extraction that uses some 
information theory based association measures such as mutual 
information and point- wise mutual information [8]. 

Wen Zhang et al (2009) highlights the deficiencies of 
mutual information and suggested an enhanced mutual 
information based association measures to overcome the 
deficiencies [9]. The major deficiencies of the classical mutual 
information , as they mention, are its poor capacity to measure 
association of words with unsymmetrical co-occurrence and 
adjustment of threshold value. [10] Anoop et al (2008) also 
used various statistical measures such as point- wise mutual 
information(K. Church et al., 1990) [11], log -likelihood, 
frequency of occurrence, closed form (e.g., blackboard) 
count, hyphenated count (e.g., black-board) for extraction of 
Hindi compound noun multiword extraction. Aswhini et al 
(2004) has used co-occurrence and significance function to 
extract MWE automatically in Bengali, focusing mainly on 
noun- verb MWE [12]. [13] Sandipan et al (2006) has used 
association measures namely salience (Adam Kilgarrif et al., 
2000) [14], mutual information and log likelihood for finding 
N-V collocation. Tanmoy (2010) has used a linear combination 
of some of the association measures namely co-occurrence, 
Phi, significance function to obtain a linear ranking function 
for ranking Bengali noun-noun collocation candidates and 
MWEness is measured by the rank score assigned by the 
ranking function [15]. 

The statistical tool (e.g., log likelihood ratio) may miss 
many commonly used MWEs that occur in low frequencies. 
To overcome this problem, some linguistic clues are also 
useful for multiword expression extraction. [16] Scott Songlin 
Paul et al (2005) focuses on a symbolic approach to multiword 
extraction that uses large-scale semantically classified 
multiword expression template database and semantic field 
information assigned to MWEs by the USAS semantic 
tagger(Paul Rayson et al.,2004 ) [17]. [18] R. Mahesh et al 
(2011) has used a stepwise methodology that exploits 
linguistic knowledge such as replicating words(ruk ruk e.g. 
stop stop), pair of words(din-raat e.g. day night), 
samaas(N+N,A+N) and Sandhi(joining or fusion of words), 
Vaalaa morpheme(jaane vaalaa e.g. about to go) constructs 
for mining Hindi MWEs. A Rule-Based approach for 
identifying only reduplication from Bengali corpus has been 
presented in Tanmoy et al (2010) [19]. A semantic clustering 
based approach for indentifying bigram noun-noun MWEs 
from a medium-size Bengali corpus has been presented in 
Tanmoy et al (2011) [20]. The authors of this paper 
hypothesize that the more the similarity between two 
components in a bigram, the less the probability to be a MWE. 
The similarity between two components is measured based 
on the overlap between the synonymous sets of the 
component words. 

Pavel Pecina (2008) used linear logistic regression, linear 
discriminant analysis (LDA) and Neural Networks separately 
on feature vector consisting of 55 association measures for 
extracting MWEs [21]. M.C. Diaz-Galiano et al. (2004) has 
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applied Kohonen's linear vector quantization (LVQ) to 
integrate several statistical estimators in order to recognize 
MWEs [22]. [23] Sriram Venkatapathy et al. (2005) has 
presented an approach to measure relative compositionality 
of Hindi noun-verb MWEs using Maximum entropy model 
(MaxEnt). Kishorjit et al (201 1) has used a conditional random 
field (CRF) for Manipuri MWE extraction [24]. 

Hybrid methods combine statistical, linguistic and/or 
machine learning methods. Maynard and Ananiadou (2000) 
integrated both linguistics and statistical information in their 
system called TRUCK, for extracting multi-word terms [25]. 
[26] Dias (2003) has developed a hybrid system for MWE 
extraction, which integrates word statistics and linguistic in- 
formation. [27] Carlos Ramisch et al.(2010) presents a hybrid 
approach to multiword expression extraction that combines 
the strengths of different sources of information using a 
machine learning algorithm. [5] Ivan A. Sag et al (2002) ar- 
gued in favor of maintaining the right balance between sym- 
bolic and statistical approaches while developing a hybrid 
MWE extraction system. 

III. Proposed Mwe Identicication Method 

Our proposed MWE identification method has several 
major steps: preprocessing, candidate MWE extraction and 
MWE identification by classifying the candidates MWEs 
into two categories: positive (MWE) and negative (non- 
MWE). 

A. Preprocessing 

At the preprocessing step, unformatted documents are 
segmented into a collection of sentences automatically by 
checking Dari (in English, full stop), Question mark (?) and 
Exclamation sign (!). Typographic or phonetic errors are not 
corrected automatically. Then the sentences are submitted 
to the chunker 1 one by one for processing. 

B. Candidate MWE Extraction 

The chunked sentences are processed to identify the 
multi-word expression candidates. The multiword expression 
candidates are primarily extracted using the following rule: 

Bigram consecutive token sequence within same NP chunk 
is extracted from the chunked sentences if the Tag of the 
token is NN or NNP or XC (NN: Noun, NNP: Proper Noun, 
XC: compounds (Akshar Bharati et al., 2006)) [28]. 
We observed that some potential multi- word expressions 
are missed due to the chunker 's error. For example, the 
chunked version of the sentence is ((NP ^(.^lsi NN)) 
((NP ftasEI NN )) (( NP Hl^M NN , SYM )). In this 
example, we find that the potential multi-word expression 
candidate :: f^f3 Hfe*^" (BSA Cycle) cannot be 
detected using the above rule since (BSA) and 

" H l ^H " (Cycle) belong to the different chunks. 

To identify more number of potential MWE candidates, 
we use some heuristic rules as follows: 

Bigram noun-noun compounds which are hyphenated or 
occur within single quote or within first brackets or whose 
words are out of vocabulary (OOV) are also considered as 
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the potential candidates for MWE. 
C. Features 

The association measures namely phi, point- wise mutual 
information (pmi), salience, log likelihood, poisson Stirling, 
chi and t-score have been used to calculate the scores of 
each candidate MWE. These association measures use 
various types of frequency statistics associated with the 
bigram. The frequency statistics used in computing 
association measures are represented using a typical 
contingency table format (Satanjeev Banerjee et al., 2003) 
[29]. Table 1 shows a typical contingency table showing 
various types of frequencies associated with the bigram 
<wordl, word2>(e.g., WW). 

Table I. Contingency Table 





(EDvenumnt) 


(~ Eovanun^nt) 








n.L2 


nip 


(-stati) 




Hi 2 


n2p 




up . 




npp 



The meanings of the entries in the contingency table are 
given below: 

n u = number of times the bigram occurs, joint frequency. 
n 12 = number of times wordl occurs in the first position of a 
bigram when word2 does not occur in the second position. 
n 21 = number of times word2 occurs in the second position of 
a bigram when wordl does not occur in the first position. 
n 22 = number of bigrams where wordl is not in the first position 
and word2 is not in the second position, 
nip = the number of bigrams where the first word is wordl, 
that is, nlp=n u +n 12 . 

npl = the number of bigrams where the second word is word2, 
that is npl=n u +n 21 . 

n2p = the number of bigrams where the first word is not 
wordl, that is n2p=n 21 +n 22 . 

np2 = the number of bigrams where the second word is not 
word2, that is np2=n 12 +n 22 . 

npp is the total number of bigram in the entire corpus. 
Using the frequency statistics given in the contingency table, 
expected frequencies, m u m 12 , m 21 and m 22 are calculated as 
follows: 

m u = (nlp*npl/npp) 
m 12 = (nlp*np2/npp) 
m 21 = (npl*n2p/npp) 
m 22 = (n2p*np2/npp) 

where: 

m u : Expected number of times both words in the bigram occur 
together if they are independent. 

m 12 : Expected number of times wordl in the bigram will occur 
in the first position when word2 does not occur in the second 
position given that the words are independent. 
m 21 : Expected number of times word2 in the bigram will occur 
in the second position when wordl does not occur in the 
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first position given that the words are independent. 
m 22 : Expected number of times wordl will not occur in the 
first position and word2 will not occur in the second position 
given that the words are independent. 

The following association measures that use the above 
mentioned frequency statistics are used in our experiment. 
Phi, Chi and T-score 

The Phi, Chi and T-score are calculated using the following 
equations: 



phi = ((W H* W 22 )-Ol2* w 21 



yj(nl p*np l*np 2*n 2p) 



di=2*((^^'X,,) 2 +^ 2 X, 2 ) 2 +( ( " 2i ^,) 2 +( ( " 22 ^) 2 ) 



Score - 



( »i i - m ! ! ) 



(1) 

(2) 
(3) 



Log likelihood, Pmi, Salience and Poisson Stirling 

Log likelihood is calculated as: 

ID=2*(n *Wn *m)+n *Wn *m)+n *Wn *m)+n *Wn *m)) 

v n CK n n / E j2 ]2 21 21 2T 22 ^ 22 71." 

Point- wise Mutual Information (pmi) is calculated as: 

pmi = logC^J 

The salience is defined as: 

salience = (lo g ( " x y£ xl ) ) * lo g ( n l l ) 



(4) 



(5) 



(6) 

The Poisson Stirling measure is calculated using the formula: 
Poisson - Stirling - n n * ((log("^ n ) - 1) 

(7) 

Co-occurrence 

Co-occurrence is calculated using the following formula 
(Aswhini Agarwal et al, 2004) [12]: 

C0(W1,W2)= Z e -d(s, W l, W 2) 

seS (wl,w 2) 

(8) 

Where: 

co(wl,w2) = co-occurrence between the words (after 
stemming ). 

S(wl,w2)= set of all sentences where both wl and w2 occurs. 
d(s,wl,w2)= distance between wl and w2 in a sentence in 
terms of words. 

Significance Function 

The significance function (Aswhini Agarwal et al., 2004) [12] 
is defined as: 
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sig wl (w2) = cWa-coW, w2).^)].o[kl^ -1] 

(9) 

sig (wl, w2) = sig wl (w2). expt^^^ - 1] 

(10) 

Where: 

sig wl (w2) = significance of w2 with respect to wl. 
f < w2 > = number of wl with which w2 has occurred. 

wl 

Sig(wl,w2)= general significance of wl and w2 , lies between 
Oandl. 

g(x)= sigmoid function =exp(-x)/(l+exp(-x))] 
kl and k2 define the stiffness of the sigmoid curve ( for 
simplicity they are set to 5.0 ) 

X is defined as the average number of noun-noun co- 
occurrences. 

D. MWE Identification using a Decision Tree 

Decision tree learning is a method for approximating 
discrete-valued target functions, in which the learned 
function is represented in the form of a decision tree. Decision 
trees are supervised algorithms which recursively partition 
the data, based on its attributes, until some stopping 
condition is reached. This recursive partitioning gives rise to 
a tree-like structure. A decision tree is a tree where the non- 
leaf nodes are labeled with attributes. The arcs from a node 
representing the attribute A, are labeled with each of the 
possible values of the attribute A. The leaves of the tree are 
labeled with classifications. Decision trees classify instances 
by sorting them down the tree from the root to some leaf 
node, which provides the classification of the instance 
(Mitchell, 1997) [30]. An instance is initially classified by 
starting at the root node of the tree, testing the attribute 
specified by this node, then moving down the tree branch 
corresponding to the value of the attribute. 

The most important feature of a decision tree classifier is 
its capability to break down a complex decision-making 
process into a collection of simpler decisions, thus providing 
a solution, which is often easier to interpret (Saf avian et al., 
1991) [31]. 

C4.5 is an algorithm developed by Ross 
Quinlan(Quinlan, 1 993) [32] . This is used to generate a decision 
tree. C4.5 is an extension of Quinlan's earlier ID3 algorithm 
(Quinlan,1986) [33]. For our MWE identification task, the 
C4.5 decision tree is trained to classify the candidate MWEs 
in a document as one of two categories: "MWE", "not a 
MWE". 

Training a decision tree learning algorithm for MWE 
identification requires candidate MWEs to be represented 
as the feature vectors. For this purpose, we write a computer 
program for automatically extracting values for the features 
characterizing the MWE candidates in the documents. For 
each candidate MWE in a document in our corpus, we extract 
the values of the features of the candidate using the measures 
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discussed in subsection C of section III. If the candidate 
MWE is found in the list of manually identified MWEs, we 
label the MWE as a "Positive" example and if it is not found 
we label it as a "negative" example. Thus the feature vector 

for each candidate looks like {<a 1 a 2 a 3 a n >, <label>} which 

becomes a training instance (example) for the decision tree, 
where a,, a . . .a , indicate feature values for a candidate. A 

12 n 

training set consisting of a set of instances of the above form 
is built up by running a computer program on the documents 
in our corpus. 

For our experiment, we use Weka (www.cs.waikato.ac.nz/ 
ml/weka) machine learning tools. We use J48, which is C4.5 
version of the decision tree under WEKA workbench, 
included under the panel Classifier/ trees of WEKA 
workbench. For our work, the J48 classifier of the WEKA 
suite has been run with the default values of its parameters. 

IV. Evaluation And Results 

For evaluating the performance of our system the 
traditional precision, recall and F-measure are computed by 
comparing machine assigned labels to the human assigned 
labels for the candidate MWEs extracted from our corpus of 
274 Bengali documents. 

A. Experimental dataset 

Our corpus is created by collecting the news articles from 
the online version of well known Bengali newspaper 
ANAND AB AZAR PATRIKA during the period spanning from 
20.09.2012 to 19.10.2012. The news articles published online 
under the section Rajya (State) , Desh on the topics bandh- 
dharoghat, crime, disaster, jongi, mishap, political and 
miscellaneous are included in the corpus. It consists of total 
274 documents and all those documents contain 1 8769 lines 
of Unicode texts, 233430 tokens. We have manually identified 
all the noun-noun compound MWEs in the collection and 
created a gold standard for labeling the training data. It 
consists of 4641 noun-noun compound MWEs. Total 8210 
noun-noun compound MWE candidates are automatically 
extracted employing chunker and heuristic rules as described 
in subsection B of section III. 

B. Results 

To estimate overall accuracy of our proposed MWE 
identification system, 10-fold cross validation is done. The 
dataset is randomly reordered and then split into n parts of 
equal size. For each of 10 iterations, one part is used for 
testing and the other n-1 parts are used for training the 
classifier. The test results are collected and averaged over all 
folds. This gives the cross-validation estimate of the accuracy 
of the proposed system. J48 which is basically a decision 
tree included in WEKA is used as a single decision tree for 
implementing our system. Our proposed decision tree based 
system gives an average F-measure of 0.77. 
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Conclusions And Future Work 

This paper presents a machine learning based approach 
for identifying noun-noun compound MWEs from a Bengali 
corpus. We have used a number of association measures as 
features which are combined by a decision learning algorithm 
for recognizing noun-noun compounds. 

As a future work, we have planned to improve the 
candidate MWE extraction step of the proposed system and/ 
or introduce new features such as lexical features and 
semantic features. 
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